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Abstract 

In previous work (Olshausen & Field 1996), an algorithm was described for learning linear sparse codes 
which, when trained on natural images, produces a set of basis functions that are spatially localized, 
oriented, and bandpass (i.e., wavelet-like). This note shows how the algorithm may be interpreted within a 
maximum-likelihood framework. Several useful insights emerge from this connection: it makes explicit the 
relation to statistical independence (i.e., factorial coding), it shows a formal relationship to the algorithm 
of Bell and Sejnowski (1995), and it suggests how to adapt parameters that were previously fixed. 
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1 Introduction 


functions, <f>*, such that 


There has been much interest in recent years in unsuper¬ 
vised learning algorithms for finding efficient representa¬ 
tions of data. Among these are algorithms for sparse or 
minimum entropy coding (Foldiak 1990; Zemel 1993; 01- 
shausen & Field 1996; Harpur & Prager 1996), indepen¬ 
dent component analysis (Comon 1994; Bell & Sejnowski 
1995; Amariet al. 1995; Pearlmutter & Parra 1996), and 
hierarchical generative modeling (Dayan 1995; Hinton 
et al. 1995). One finds common threads among many 
of these techniques, and this note is an attempt to tie 
some of them together. In particular, I will focus on the 
sparse coding algorithm of Olshausen and Field (1996) 
and its relation to maximum-likelihood techniques. As 
we shall see, forming this link enables one to see a for¬ 
mal relationship to the independent component analysis 
algorithm of Bell and Sejnowski (1995), which although 
not originally described in terms of maximum-likelihood 
may be understood in this light. I shall also show how 
the algorithm may be cast in terms of mean-held theory 
techniques in order to obtain a lower bound on the log- 
likelihood, which shares some similarity to the use of a 
“recognition distribution” in the Helmholtz machine of 
Dayan et al. What emerges from this process is a bet¬ 
ter understanding of the algorithm and how it may be 
improved. 

2 Learning linear sparse codes 

In the sparse coding learning algorithm of Olshausen and 
Field (1996), a set of basis functions, <f>i{x), is sought 
such that when an image, I(x), is linearly decomposed 
via these basis functions, 

7 (*) = ^2 a iMx) > (!) 

i 


<f>* = arg min ( min E(I, a\<f>) ) (3) 

where (•) denotes an ensemble average over the images. 
Note that in this expression and in the rest that follow, 
I refers to the vector with components /(£)), a refers 
to the vector with components a;, and <f> refers to the 
matrix with components <f>i(xj ). 

The intuition behind the algorithm is that on each im¬ 
age presentation, the gradient of S “sparsities” the activ¬ 
ity on the cii by differentially reducing the value of low- 
activity coefficients more than high-activity coefficients. 
This weeds out the low-activity units. The <f>i then learn 
on the error induced by this sparsffication process. The 
result is a set of <f>i that can tolerate sparsffication with 
minimum mean-square reconstruction error. A virtu¬ 
ally identical algorithm was developed independently by 
Harpur and Prager (1996). 

3 Maximum-likelihood framework 

While the energy function framework provides a useful, 
intuitive way of formulating the sparse coding problem, 
a probabilistic approach could provide a more general 
framework. Harpur and Prager (1996) point out that 
the first term on the right-hand side of Equation 2 may 
be interpreted as the negative log-likelihood of the im¬ 
age given if) and a (assuming a gaussian noise model), 
while the second term may be interpreted as specifying 
a particular log-prior on a. That is, 

i I1-.4.1 2 

P{I\a^) = e (4) 

^ on 

P(a) = Hi^-e-^) (5) 


the resulting coefficient values, a;, are rarely active (non¬ 
zero). In other words, the probability distribution over 
the cii should be unimodal and peaked at zero with heavy 
tails (positive kurtosis). This is accomplished by con¬ 
structing an energy function of the form 

T 1 2 

E(I, a\<f>) = ^2 I{x) ~'^2a i (j)i(x) +1^%-) , 

x L i J i 

( 2 ) 

and then minimizing it with respect to the a; and <f>i. 
The first term in Equation 2 ensures that information is 
preserved (i.e., that the <f>i span the input space), while 
the second term incurs a penalty on activity so as to 
encourage sparseness. The intuition behind the choice 
of S is that it should favor among activity states with 
equal variance (|a| 2 ) those with the fewest number of 
non-zero (or not-close-to-zero) components. The choices 
experimented with include |a 8 j, log(l + a?), and —e a '. 

Gradient descent on E is performed in two phases, one 
nested inside the other: For each image presentation, E 
is minimized with respect to the a;; the <f>i then evolve 
by gradient descent on E averaged over many image pre¬ 
sentations. Stated more formally, we seek a set of basis 


with A = 2 tr 2 N f3. Thus, we may interpret E as being 
proportional to — log P(I, a\<f>), since 

P(I,a\<j>) = P(I\a,<j>) P(a) (6) 

-i E (I,a\<P) 
cx e n 

How can we use this insight to improve our understand¬ 
ing of the algorithm? 

Under the maximum-likelihood approach, we would 
try to find the set of basis functions, <f>*, such that 

<f>* = arg max (log _P(J|</>)) (8) 

4> 

P{I\<t>) = J P(I\a,<j>)P(a)da (9) 

In other words, we are trying to find a set of <f>i that 
maximize the log-likelihood that the set of images could 
have arisen from a random process in which the <f>i are 
linearly mixed with statistically independent amplitudes 
distributed according to with additive gaus¬ 

sian image noise. This is formally equivalent to mini¬ 
mizing the Kullback-Leibler (KL) distance between the 
actual joint probability of the images, P*(I), and our 



P(a) 


P(I\a,«f) 


P(I,a\§) = P(I\a,4>) P(a) 



Figure 1: Two-dimensional iso-probability plots of a, 
Cauchy prior, b , Gaussian likelihood, and c, their prod¬ 
uct. The axes on each plot are a i, a.o. 


model of the joint probability based on independent, 
causes, P(I\<f>), since 

KL[T’*(7), P(I\cf))] = I P *(I) log (10) 

= -H P * ~{\o g p(m) (ii) 

and Hp » = — f P* log P* is fixed, so maximizing 

(log P(I\<p)} minimizes IvL. 

Unfortunately, all of this is easier said than done be¬ 
cause we have to integrate over the entire set of a.,; in 
Equation 9, which is computationally intractable. A 
reasonable approximation may be to assume that <7 jv 
is small, in which case the dominant contribution to the 
integral is at the maximum of P{I, a\<p). Thus, 

<p* = arg max ( log[max.P(/|a, <fi)P(a)\ ) . (12) 

<f> \ a / 

This is equivalent to the algorithm of Olsha.usen and 
Field (1996), as can be seen by comparing to Equation 3 
and using the definitions of Equations 4 and 5. The in¬ 
tuition for why this approximation works in practice is 
shown in Figure 1. The prior, P{a), is a product of 1-D 
“sparse” distributions, such as which are unimodal 

and peaked at zero. The likelihood, P(I\a,<f>), is a mul¬ 
tivariate gaussia.n, and since we are usually working in 
the overcomplet.e case (the number of basis functions ex¬ 
ceeds the dimensionality of the input) this will take the 
form of a gaussia.n ridge (or sandwich) that, has its max¬ 
imum along the line (or plane, etc.) given by I = a<j>. 
The product, of these two functions, P{I\a, 4>)P(a), will 
have its maximum displaced a.wa.y from the maximum 
along the gaussia.n ridge (i.e., a.wa.y from the “perfect, 
solution”) and towards the origin, but. also towards the 
ridges of the prior. Thus, the gradient, with respect, to <p 
will tend to steer the gaussia.n ridge towards the ridges 
of the prior, which will in turn increase the volume of 
their product, or P(I\<p). The reason we can get. by with 
this approximation in this case is because we are working 
with a. product, of two fairly smooth, unimodal functions. 
If the functions were not. so well behaved, then one can 
see that, such an approximation might, produce problems. 

4 Relation to Bell and Sejnowski 

Bell and Sejnowski (1995) describe an algorithm for “in¬ 
dependent. component, analysis” based on maximizing 


the mutual information between the inputs and out¬ 
puts of a. neural network. Here, we show that, this algo¬ 
rithm may be understood as solving the same maximum- 
likelihood problem described above (Section 3), except, 
by making a. different, simplifying assumption. This has 
also been shown recently by Pea.rlmut.t.er & Parra. (1996) 
and Ma.cka.y (1996). 

Bell and Sejnowski exa.mindthe case where the num¬ 
ber of basis functions is equal to the number of inputs, 
and where the (pi are linearly independent.. In this case, 
there is a. unique set. of a.,; for which \I — a,(p\ 2 equals zero 
for any given image, I. In terms of the previous dis¬ 
cussion, P(I\a,<f>) is now a. gaussia.n hump with a. single 
maximum at. a = I<p _1 , rather than a. gaussia.n ridge as 
in Figure lb. If we let. oqv go to zero in Equation 4, then 
P(I\a, <p) becomes like a. delta, function and the integral 
of Equation 9 becomes 


and 


P(I\<p) = j 6(1 — a<p)P(a)da (13) 

= P(I<P~ 1 )x | delcT 1 ! (14) 
so 

= arg max [(log P(I<p ~ 1 )) + log | det. (p -1 1] (15) 


= arg mm 

<P 


{^^S((<p ^ ■!)) - log | det, ( 


(16) 

By making the following definitions according to the con¬ 
vention of Bell and Sejnowski (1995), 


W = 


/.-i 


(17) 

Ui = Wi-I (18) 

then, the gradient, descent, learning rule for W becomes 

(19) 


AWij oc -XS'(ui)Ij + C ° f 


det, W ' 


This is precisely Bell and Sejnowski’s learning rule when 
the output, non-linearity of their network, g(x), is equal 
to the cdf (cumulative density function) of the prior on 
the a,i , i.e. j. 


Ili = g(ui) (20) 

/ Ui 1 

— e~^^dx. ( 21 ) 

-oo 

Thus, the independent, component, analysis algorithm 
of Bell and Sejnowski (1995) is formally equivalent, to 
maximum likelihood in the case of no noise and a, square 
system (dimensionality of output, = dimensionality of in¬ 
put,). It, is easy to generalize this to the case when the 
number of outputs is less than the number of inputs, but, 
not, the other way around. When the number of outputs 
is greater than the effective dimensionality of the input, 
(zfp of non-zero eigenvalues of the input, covariance ma¬ 
trix), then the extra, dimensions of the output, will simply 
drop out,. While this does not, pose a, problem for blind 
separation problems where the number of independent, 
sources (dimensionality of a ) is less than or equal to the 
number of mixed signals (dimensionality of I), it, will be¬ 
come a, concern in the representation of images, where 
overcompleteness is a, desirable feature (Simoncelli et ah, 
1992). 
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5 Lower-bound maximization 

A central idea behind the Helmholtz machine of Dayan 
et al. (1995), as well as the “mean held” theory of Saul 
et al. (1996), is the construction of an alternative proba¬ 
bility distribution, Q(a\I), that is used to obtain a lower- 
bound on logP(I\<f>). First, we rewrite log P(I\cf)) as 

log-P(/|</>) = log j Q( a \ I ) P ^ I Q^I I ^ a ^ da ( 22 ) 

Then, as long as Q is a probability (i.e., f Q = 1, Q > 0), 
we obtain by Jensen’s inequality 

log > j Q(a\I) log P( ' I q^\I)^ da ( 23 ) 

= H Q- 7 ^- T {E{I,a\(t>)) Q +const (2P) 

LU i v 

where Hq = — J Q(a\I) log Q(a\I)da. Thus, if we 
can construct Q(a\I) so that the integral is tractable, 
then we can do gradient ascent on a lower bound of 
(log P(I\(f))}. How good the bound is, though, depends 
on the Kullback-Leibler distance between Q(a\I) and 
P(a\I), or in other words, on how closely we can approx¬ 
imate P(a\I) with our tractable choice of Q. Typically, 
Q is chosen to be factorial, Q(a\I) = Hiqi(ai\I), in which 
case 

Hq = ( 25 ) 

i 

(E(I, a\(f))} Q = \I — + E I'iM 2 + 

i 

AE/ Qi( a i)S(ai)dai (26) 

i 

where m = J g 8 -(cq)cq<icq and of = f ft(cq)(cq - m) 2 dai. 

Comparing Equation 26 to Equation 2, one can see 
that the sparse coding learning algorithm of Olshausen 
and Field (1996) effectively uses (jq(cq) = i)(cq —/q-), with 
fii chosen so as to minimize E (and hence maximize the 
lower bound of Equation 24). This choice would seem 
suboptimal, though, because we are getting zero entropy 
out of Hq (actually Hq = — oo, but we are ignoring the 
infinities here because it is the derivatives we really care 
about). If we could find a qi with higher entropy which 
also lowers the energy, then we could move the bound 
closer to the true log-likelihood. However, broadening 
qi (for example, by making it gaussian with adjustable 
fii and oy) only affects the solution for p insofar as it 
low-pass filters the cost function, S, which has a simi¬ 
lar effect to simply lowering A. So, it is difficult to see 
that adding this extra complexity will improve matters. 
One apparent benefit of having non-zero cq is that there 
is now a growth-limiting term on the <f>i (second term 
on the right side of Eq. 26). Without such a term, the 
<f>i will grow without bound, and so it is necessary in 
the algorithm of Olshausen and Field (1996) to keep the 
<f>i normalized (which is rather ad hoc by comparison). 
Preliminary investigation using a Gaussian qi and mini¬ 
mizing E with respect to both fii and cq for each image 
(but still keeping the <f>i normalized) does not reveal sig¬ 
nificant differences in the solution, but it deserves further 


study. It may also be worthwhile to try using a Q(a\I) 
that is defined by pairwise statistics (i.e., a covariance 
matrix on the cq). 

It should be noted that what is important here is 
the location of the maximum of whatever approximating 
function we use, not the absolute value of the bound per 
se. If the maximum of the lower-bound occurs at a sig¬ 
nificantly different point than the maximum of the true 
log-likelihood, then the approximation is not much help 
to us. 

6 Discussion 

What I think has been gained from this process is a bet¬ 
ter understanding of both the sparse coding algorithm of 
Olshausen and Field (1996) and the independent com¬ 
ponent analysis algorithm of Bell and Sejnowski (1995). 
Although neither of these algorithms was originally cast 
in maximum-likelihood terms, they are both essentially 
solving the same problem. The main difference between 
them is in the simplifying assumptions they make in 
order to deal with the intractable integration problem 
posed by Equation 9: Olshausen and Field’s algorithm 
assumes low-noise (small cqy) and thus a peaky, uni- 
modal distribution on P(I, a\<j)) in order to justify eval¬ 
uating it at the maximum, whereas Bell and Sejnowski 
limit the dimensionality of the cq to equal the dimension¬ 
ality of the input and also assume no noise so that the 
integral becomes tractable. The maximum-likelihood 
framework also makes possible the link to techniques 
used in the Helmholtz machine (Dayan et al., 1995), 
which reveals that a better choice of approximating dis¬ 
tribution, Q, could potentially lead to improvements. 

A practical advantage of looking at the problem 
within this framework is that it suggests we could adapt 
the shape of the prior. For example, the prior on the 
cq need not be i.i.d., but could be shaped differently for 
each cq, e.g., .P(cq) = E— e -P' s ( a Q , i n order to best fit 

the data. Adapting /?; would be accomplished by letting 
it evolve along the gradient of (logP(I\(f))}. Using the 
approximation of Equation 12, this yields the learning 
rule: 

1 f) 7 

A problem that may arise here, due to the fact that 
the full integral in Equation 9 is not being computed, is 
that there may be a bias toward non-informative flat pri¬ 
ors (since these will yield perfect reconstruction on each 
trial). An advantage of Bell and Sejnowski’s algorithm 
in this case is that it essentially computes the full inte¬ 
gral in Equation 9 and so does not have this problem. 
For their algorithm, the maximum-likelihood framework 
prescribes a method for adapting the “generalized sig¬ 
moid” parameters p and r for shaping the prior (see pp. 
1137-8 of their paper), again by doing gradient ascent 
on the average log-likelihood. (See also Mackay, 1996, 
for other methods of parameterizing and adapting a fac¬ 
torial prior.) In cases where a statistically independent 
linear code may not be achieved (e.g., natural images), it 
may be advantageous to alter the prior so that informa¬ 
tion about pairwise or higher-order statistical dependen- 



cies among the a; may by incorporated into our model 
of P(a), for example using a Markov random field type 
model. 
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